We propose using Hierarchical Clustering and with our model, we segmented the cusomters into 6 clusters. Here is a quick summary of each cluster:
Through our EDA and data preprocessing, we can calculate a few extra metrics including: total household size, average purchase price, total spent on needs, and total spent on luxuries. We can see that the majority of customers in our system make very few purchases skewing our data to the right. We have a long thin tail for purchases spent on necessities, but a more substantial tail for purchasing luxuries. Not surprisingly, our most influential metric does seem to be income.
Use our model to tailor your marketing strategy for each cluster.
In sports, a divide has grown between the up-and-coming data analytics crowd and the old school scouts. The old school scouts say that the analytics don't take into account the "Eye Test," aka the evaluation of how an athlete performs visually vs how they appear in the stats. The issue is, that's not even the question that we should be asking, the question that should be asked is "how can analytics improve the eye test?" A trained scout's eye can see many things that statistics will miss, but the power in statistics is that stats "watch" every single game. That's the problem that we're trying to solve right now. We need to understand each and every one of customers to utilize our resources efficiently. We, as analysts, can look at demographics and see what kinds of people buy what kinds of products, but there are many underlying trends that humans can't see. We need algorithms to be able to comb through the data and identify patterns in customer behaviors.
The intended goal is analyze our customer data in order to segment the customers into different clusters. We will then be able to profile each cluster to understand how each kind of customer interacts with our brand.
Who are our customers overall? What are the general characteristics of our customer base? Which variables are most important in predicting customer behavior? What additional information can we extract from our dataset? How many clusters should we segment into? What kind of customer makes up each cluster and how do those customers interact with our brand?
We are using data science to find the patterns in our customer's behavior and find similar customers who might be likely to behave the same way. Using algorithms to identify similar customers will be far easier than playing a 2240 piece game of Memory trying to match customers up. Data science will also be able to identify the most important variables and determine the weight they carry in our data.
The overarching problem that we are trying to solve is how can we better serve our customers and attract more in the ultimate goal of growing our business. We have data about our customers and their actions, we need to better understand who our customers are and how they behave so that we can most effectively use our resources to service them.
Our solution will provide you with a full understanding of your customer base. You will have detailed profiles of customer archetypes and their behaviors. This will allow you to target your marketing efforts to different clusters, run A/B testing to see which types of ads work on which types of customers, and minimize wasted resources.
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from scipy.spatial.distance import cdist, pdist
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('marketing_campaign.csv')
data_copy = data.copy()
data.shape
(2240, 27)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
# Convert to date time
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format ='%d-%m-%Y')
# Check to see if there are duplicates
data[data.duplicated()]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response |
|---|
0 rows × 27 columns
# Drop the rows with null values in the income column
data_no_null = data.dropna()
data_no_null.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2216 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2216 non-null int64 1 Year_Birth 2216 non-null int64 2 Education 2216 non-null object 3 Marital_Status 2216 non-null object 4 Income 2216 non-null float64 5 Kidhome 2216 non-null int64 6 Teenhome 2216 non-null int64 7 Dt_Customer 2216 non-null datetime64[ns] 8 Recency 2216 non-null int64 9 MntWines 2216 non-null int64 10 MntFruits 2216 non-null int64 11 MntMeatProducts 2216 non-null int64 12 MntFishProducts 2216 non-null int64 13 MntSweetProducts 2216 non-null int64 14 MntGoldProds 2216 non-null int64 15 NumDealsPurchases 2216 non-null int64 16 NumWebPurchases 2216 non-null int64 17 NumCatalogPurchases 2216 non-null int64 18 NumStorePurchases 2216 non-null int64 19 NumWebVisitsMonth 2216 non-null int64 20 AcceptedCmp3 2216 non-null int64 21 AcceptedCmp4 2216 non-null int64 22 AcceptedCmp5 2216 non-null int64 23 AcceptedCmp1 2216 non-null int64 24 AcceptedCmp2 2216 non-null int64 25 Complain 2216 non-null int64 26 Response 2216 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(23), object(2) memory usage: 484.8+ KB
Questions:
# Drop the ID Column, it will not be helpful in our analysis
data_dropped = data_no_null.drop(columns = 'ID')
# Identify all the numeric columns
num_cols = data_dropped.select_dtypes(include=['number'])
# Isolate the binary response columns
num_cols_no_response = num_cols.iloc[:,:16]
response_cols = num_cols[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4',
'AcceptedCmp5', 'Complain', 'Response']]
# Distiniguish the categorical columns
cat_cols = data_dropped.select_dtypes(include=['object'])
# Check summary statistics of numeric columns
round(data_dropped.describe().T,2)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2216.0 | 1968.82 | 11.99 | 1893.0 | 1959.0 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.25 | 25173.08 | 1730.0 | 35303.0 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2216.0 | 0.44 | 0.54 | 0.0 | 0.0 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2216.0 | 0.51 | 0.54 | 0.0 | 0.0 | 0.0 | 1.00 | 2.0 |
| Recency | 2216.0 | 49.01 | 28.95 | 0.0 | 24.0 | 49.0 | 74.00 | 99.0 |
| MntWines | 2216.0 | 305.09 | 337.33 | 0.0 | 24.0 | 174.5 | 505.00 | 1493.0 |
| MntFruits | 2216.0 | 26.36 | 39.79 | 0.0 | 2.0 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2216.0 | 167.00 | 224.28 | 0.0 | 16.0 | 68.0 | 232.25 | 1725.0 |
| MntFishProducts | 2216.0 | 37.64 | 54.75 | 0.0 | 3.0 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2216.0 | 27.03 | 41.07 | 0.0 | 1.0 | 8.0 | 33.00 | 262.0 |
| MntGoldProds | 2216.0 | 43.97 | 51.82 | 0.0 | 9.0 | 24.5 | 56.00 | 321.0 |
| NumDealsPurchases | 2216.0 | 2.32 | 1.92 | 0.0 | 1.0 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2216.0 | 4.09 | 2.74 | 0.0 | 2.0 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2216.0 | 2.67 | 2.93 | 0.0 | 0.0 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2216.0 | 5.80 | 3.25 | 0.0 | 3.0 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2216.0 | 5.32 | 2.43 | 0.0 | 3.0 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2216.0 | 0.07 | 0.26 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2216.0 | 0.07 | 0.26 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2216.0 | 0.07 | 0.26 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2216.0 | 0.06 | 0.24 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2216.0 | 0.01 | 0.11 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| Complain | 2216.0 | 0.01 | 0.10 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
| Response | 2216.0 | 0.15 | 0.36 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 |
# Check summary statistics of categorical columns
print(cat_cols.describe(),'\n')
print(cat_cols.value_counts())
Education Marital_Status
count 2216 2216
unique 5 8
top Graduation Married
freq 1116 857
Education Marital_Status
Graduation Married 429
Together 285
Single 246
PhD Married 190
Master Married 138
Graduation Divorced 119
PhD Together 116
Master Together 102
PhD Single 96
2n Cycle Married 80
Master Single 75
2n Cycle Together 56
PhD Divorced 52
Master Divorced 37
2n Cycle Single 36
Graduation Widow 35
PhD Widow 24
2n Cycle Divorced 23
Basic Married 20
Single 18
Together 14
Master Widow 11
2n Cycle Widow 5
PhD YOLO 2
Master Alone 1
Absurd 1
Graduation Alone 1
Absurd 1
Basic Widow 1
PhD Alone 1
Basic Divorced 1
dtype: int64
# Replace 2n cycle with master
data_dropped.loc[data_dropped['Education'] == '2n Cycle', 'Education'] = 'Master'
# Replace all alone/YOLO/Absurd data as single
data_dropped.loc[((data_dropped['Marital_Status'] == 'Alone') |
(data_dropped['Marital_Status'] == 'YOLO')|
(data_dropped['Marital_Status'] == 'Absurd')), 'Marital_Status'] = 'Single'
cat_cols_cleaned = data_dropped.select_dtypes(include=['object'])
print(cat_cols_cleaned.describe(),'\n')
print(cat_cols_cleaned.value_counts())
Education Marital_Status
count 2216 2216
unique 4 5
top Graduation Married
freq 1116 857
Education Marital_Status
Graduation Married 429
Together 285
Single 248
Master Married 218
PhD Married 190
Master Together 158
Graduation Divorced 119
PhD Together 116
Master Single 113
PhD Single 99
Master Divorced 60
PhD Divorced 52
Graduation Widow 35
PhD Widow 24
Basic Married 20
Single 18
Master Widow 16
Basic Together 14
Widow 1
Divorced 1
dtype: int64
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Leading Questions:
# Calculate the number of rows needed for the plots (each column needs 2 rows)
n_cols = 2
n_rows = len(num_cols_no_response.columns)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 3))
# Plot box plots and histograms
for i, column in enumerate(num_cols_no_response.columns):
# Box plot
sns.boxplot(x=data[column], ax=axes[i, 0])
axes[i, 0].set_title(f'Box Plot {column}')
# Histogram
sns.histplot(data[column], kde=False, ax=axes[i, 1])
axes[i, 1].set_title(f'Histogram {column}')
plt.tight_layout()
# Show the plot
plt.show()
# Drop the major outliers in Income and Year of Birth
data_dropped = data_dropped[(data_dropped['Income'] <= 600000) & (data_dropped['Year_Birth'] > 1915)]
# Define the columns and rows needed to plot the graphs
n_cols = len(cat_cols_cleaned.columns)
n_rows = 1
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10,5))
# Plot box plots and histograms
for i, column in enumerate(cat_cols_cleaned.columns):
cat_cols_cleaned[column].value_counts().plot(kind='bar', ax=axes[i])
axes[i].set_title(f'{column} Bar Plot')
axes[i].set_xlabel('Categories')
axes[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# Create an empty data frame
offers_vcs = pd.DataFrame()
# Create a for loop to add the value counts for customers who interacted with brand marketing campaigns
for column in response_cols.columns:
offers_vcs[column] = response_cols[column].value_counts()
# Plot the values
response_cols.sum().plot(kind='bar')
plt.xlabel('Number of Responses')
plt.ylabel('Frequency')
plt.title('Number of Times Customer Responded to Offers')
plt.show()
print(offers_vcs)
AcceptedCmp1 AcceptedCmp2 AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 \ 0 2096 2211 2077 2073 2077 1 144 29 163 167 163 Complain Response 0 2219 1906 1 21 334
# Create a heatmap showing the correlation between all numeric variables
plt.figure(figsize = (12, 10))
sns.heatmap(data_dropped[num_cols_no_response.columns].corr(), annot = False, cmap = "rocket")
plt.show()
#Create a dataset excluding income so we can compare the rest of the variables to income
all_col = num_cols_no_response.columns
except_income = [col for col in all_col if col != 'Income']
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
# Create scatter plots for each variable
for i, variable in enumerate(except_income):
plt.subplot(n_rows, n_cols, i + 1)
sns.scatterplot(data=data_dropped, x='Income', y=variable)
plt.title(variable)
plt.tight_layout()
plt.show()
data_dropped = data_dropped[(data_dropped['Income'] <= 150000)]
In this section, we will first prepare our dataset for analysis.
Think About It:
# Create a new dataframe to add additional columns
df_feat_eng = data_dropped.copy()
# Determine age just by subtracting birth year from our current year
df_feat_eng['Age'] = 2024 - df_feat_eng['Year_Birth']
# Determine the amount of children in a home by adding kids and teens
df_feat_eng['Dependents'] = df_feat_eng['Kidhome'] + df_feat_eng['Teenhome']
# Create a function to calculate the total household
def calc_household(row):
if row['Marital_Status'] == ('Married' or 'Together'):
return row['Dependents'] + 2
else:
return row['Dependents'] + 1
# Apply the function to the whole dataset and plot the counts
df_feat_eng['Total_Household'] = df_feat_eng.apply(calc_household, axis=1)
df_feat_eng['Total_Household'].value_counts().plot(kind='bar')
plt.show()
# Add together all purchases that are necessities
df_feat_eng['total_spent_needs'] = (df_feat_eng['MntFishProducts'] + df_feat_eng['MntMeatProducts']
+ df_feat_eng['MntFruits'])
# Add together all purchases that are luxuries
df_feat_eng['total_spent_luxuries'] = (df_feat_eng['MntSweetProducts'] + df_feat_eng['MntWines']
+ df_feat_eng['MntGoldProds'])
# See how many total offers a customer has accepted
df_feat_eng['total_accepted_offers'] = (df_feat_eng['AcceptedCmp1'] + df_feat_eng['AcceptedCmp2']
+ df_feat_eng['AcceptedCmp3'] + df_feat_eng['AcceptedCmp4']
+ df_feat_eng['AcceptedCmp5'])
# Count the number of transactions the customer has made over the various mediums
df_feat_eng['total_amt_purchases'] = (df_feat_eng['NumCatalogPurchases'] + df_feat_eng['NumStorePurchases']
+ df_feat_eng['NumWebPurchases'])
# Calculate the average purchase price by adding all spending and dividing by count of transactions
df_feat_eng['avg_purchase_price'] = round((df_feat_eng['total_spent_needs'] + df_feat_eng['total_spent_luxuries'])
/ df_feat_eng['total_amt_purchases'],2)
# Calculate how many days customers have been enrolled
current_date = datetime.now()
def round_to_days(date):
return (current_date - date).days
df_feat_eng['days_enrolled'] = df_feat_eng['Dt_Customer'].apply(round_to_days)
# Create a dataframe of just the new columns we made to look at their distributions
new_cols = df_feat_eng[['total_spent_needs', 'total_spent_luxuries', 'total_amt_purchases',
'avg_purchase_price','total_accepted_offers', 'days_enrolled']]
# Calculate the number of rows needed for the plots (each column needs 2 rows)
n_cols = 2
n_rows = len(new_cols.columns)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 3))
# Plot box plots and histograms
for i, column in enumerate(new_cols.columns):
# Box plot
sns.boxplot(x=new_cols[column], ax=axes[i, 0])
axes[i, 0].set_title(f'Box Plot {column}')
# Histogram
sns.histplot(new_cols[column], kde=False, ax=axes[i, 1])
axes[i, 1].set_title(f'Histogram {column}')
plt.tight_layout()
# Show the plot
plt.show()
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
# Create a dataframe of the behavioral variables that will serve as segmentation attributes
seg_att = df_feat_eng[['NumDealsPurchases', 'NumWebVisitsMonth','Complain', 'Response',
'total_spent_needs', 'total_spent_luxuries','total_accepted_offers',
'NumWebPurchases','NumCatalogPurchases', 'NumStorePurchases']]
# Create a dataframe of the characteristic variables that will serve as our profiling attributes.
prof_att = df_feat_eng[['Education', 'Marital_Status', 'Income','Age', 'Dependents','Total_Household',
'avg_purchase_price','days_enrolled']]
prof_categorical = prof_att.select_dtypes(include = ['object']).columns
prof_num = prof_att.select_dtypes(include = ['number']).columns
# Scale the data so we can analyze it
scale = StandardScaler()
X = seg_att
scaledX = scale.fit_transform(X)
data_scaled = pd.DataFrame(scaledX, columns = seg_att.columns)
plt.figure(figsize = (12, 10))
sns.heatmap(data_scaled.corr(), annot = False, cmap = "rocket")
plt.show()
# Plot the t-sne values with increase perplexity in a for loop
for i in range(10, 50, 5):
tsne = TSNE(n_components = 2, random_state = 1, perplexity = i)
data_tsne = tsne.fit_transform(scaledX)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (5,5))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne)
plt.title("perplexity = {}".format(i))
Observation and Insights:
Think about it:
# Defining the number of principal components to generate
n = scaledX.shape[1]
# Finding principal components for the data
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(scaledX))
# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
plt.figure(figsize = (10, 10))
plt.plot(range(1, n+1),pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
# Plot the principal values in relation to the numeric variables
n = 6
pca1 = PCA(n_components = n, random_state = 1)
data_pca = pd.DataFrame(pca1.fit_transform(scaledX))
pc_components = ['PC1', 'PC2', 'PC3','PC4','PC5', 'PC6','PC7','PC8','PC9', ]
cols = pc_components[:n]
pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:len(cols)], 2), index = seg_att.columns,
columns = cols)
# Create subplots
fig, axs = plt.subplots(len(cols), 1, figsize=(10, 4 * len(cols)))
# Plot each principal component as a separate subplot
for i, col in enumerate(cols):
pc1[col].plot(kind='bar', ax=axs[i])
axs[i].set_title(f'Principal Component Analysis: {col}')
axs[i].set_xlabel('Variables')
axs[i].set_ylabel('Principal Component Values')
axs[i].set_xticklabels(pc1.index, rotation=45)
plt.tight_layout()
plt.show()
Observation and Insights:
I chose to have 6 principal components because it explained over 90% of the variance and I didn't see any benefit to going above that after testing out more or less.
Think About It:
sse = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm.
# Use inertia attribute from the clustering object and store the inertia value for that K
for k in range(1, 16):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
sse[k] = kmeans.inertia_
# Elbow plot
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()),'-bo')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
# Empty dictionary to store the Silhouette score for each value of K
sc = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm. Store the Silhouette score for that K
for k in range(2, 16):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
labels = kmeans.predict(data_pca)
sc[k] = silhouette_score(data_pca, labels)
# Elbow plot
plt.figure()
plt.plot(list(sc.keys()), list(sc.values()), '-bo')
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette Score")
plt.show()
Observation and Insights:
# Define the number of clusters I will use throughout the notebook
n_clusters = 6
# Initialize the KMeans algorithm
kmeans_pca = KMeans(n_clusters = n_clusters, init = 'k-means++', random_state = 1)
# Fit the pca data with kmeans
kmeans_pca.fit(data_pca)
# Add the labels to my data frame
df_feat_eng['KmeansLabels'] = kmeans_pca.labels_
# Define the color palette I want to use throughout the notebook
color_palette = sns.color_palette("colorblind", df_feat_eng['KmeansLabels'].nunique())
# Show the count of each cluster
df_feat_eng['KmeansLabels'].value_counts()
4 921 1 479 0 452 2 173 3 160 5 20 Name: KmeansLabels, dtype: int64
num_pcs = data_pca.shape[1] # Number of principal components
plot_index = 1
plt.figure(figsize=(20,num_pcs * 10))
# Create a for loop to compare all the PCs with each other and visualize the clusters
for i in range(num_pcs):
for j in range(i + 1, num_pcs):
plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j],
hue=df_feat_eng['KmeansLabels'],palette=color_palette)
plt.xlabel(f'PC {i+1}')
plt.ylabel(f'PC {j+1}')
plt.title(f'Principal Components {i+1} vs {j+1}')
plot_index += 1
plt.tight_layout()
plt.show()
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
data_tsne = tsne.fit_transform(scaledX)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (5,5))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['KmeansLabels'],palette=color_palette)
plt.title("perplexity = 30")
plt.show()
# Create a list of the names for the numeric columns
num_col = seg_att
#num_col.drop(columns = 'KmeansLabels',inplace=True)
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
plt.subplot(len(num_col.columns), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmeansLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
# Create a list of the variables we want to compare to weight
except_income = [col for col in num_col if col != 'Income']
#and col != 'Year_Birth'
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['KmeansLabels'].nunique())
for i, variable in enumerate(except_income):
plt.subplot(n_rows, n_cols, i + 1)
sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['KmeansLabels'],
palette=color_palette)
plt.title(variable)
plt.tight_layout()
plt.show()
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
for h,i in enumerate(prof_categorical):
grouped_counts = df_feat_eng.groupby(['KmeansLabels',i]).size().reset_index(name='counts')
plt.subplot(n_rows, n_cols, h + 1)
sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'KmeansLabels', palette = color_palette)
plt.xlabel(i)
plt.ylabel('Counts')
plt.title(f'KMeans Cluster Profile by {i}')
plt.legend(title=i)
plt.tight_layout()
plt.show()
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
clusters = sorted(df['KmeansLabels'].unique())
n_clusters = len(clusters)
n_cols = len(columns)
plt.figure(figsize=(15, n_clusters * 5))
for idx, cluster in enumerate(clusters):
for jdx, column in enumerate(columns):
cluster_data = df[df['KmeansLabels'] == cluster]
counts = cluster_data[column].value_counts()
plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')
plt.tight_layout()
plt.show()
# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']
# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
plt.subplot(len(prof_num), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmeansLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
The clustering isolated the people who complain into one very small cluster. Age wasn't a very helpful profiling attribute because every cluster has a similar range.
Think About It:
Summary of each cluster:
kmedo = KMedoids(n_clusters = n_clusters, random_state = 1)
kmedo.fit(data_pca)
df_feat_eng['KmedoLabels'] = kmedo.predict(data_pca)
df_feat_eng['KmedoLabels'].value_counts()
1 498 5 497 0 385 4 328 2 309 3 188 Name: KmedoLabels, dtype: int64
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
data_tsne = tsne.fit_transform(scaledX)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (5,5))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['KmedoLabels'],palette=color_palette)
plt.title("perplexity = 15")
plt.show()
# Assuming data_pca is your DataFrame with principal components and df_feat_eng['KmedoLabels'] is your label column.
num_pcs = data_pca.shape[1] # Number of principal components
plot_index = 1
plt.figure(figsize=(20,num_pcs * 10))
for i in range(num_pcs):
for j in range(i + 1, num_pcs):
plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j],
hue=df_feat_eng['KmedoLabels'],palette=color_palette)
plt.xlabel(f'PC {i+1}')
plt.ylabel(f'PC {j+1}')
plt.title(f'Principal Components {i+1} vs {j+1}')
plot_index += 1
plt.tight_layout()
plt.show()
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
plt.subplot(len(num_col.columns), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmedoLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['KmedoLabels'].nunique())
for i, variable in enumerate(except_income):
plt.subplot(n_rows, n_cols, i + 1)
sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['KmedoLabels'],
palette=color_palette)
plt.title(variable)
plt.tight_layout()
plt.show()
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
for h,i in enumerate(prof_categorical):
grouped_counts = df_feat_eng.groupby(['KmedoLabels',i]).size().reset_index(name='counts')
plt.subplot(n_rows, n_cols, h + 1)
sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'KmedoLabels', palette = color_palette)
plt.xlabel(i)
plt.ylabel('Counts')
plt.title(f'K Medoids Cluster Profile by {i}')
plt.legend(title=i)
plt.tight_layout()
plt.show()
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
clusters = sorted(df['KmedoLabels'].unique())
n_clusters = len(clusters)
n_cols = len(columns)
plt.figure(figsize=(15, n_clusters * 5))
for idx, cluster in enumerate(clusters):
for jdx, column in enumerate(columns):
cluster_data = df[df['KmedoLabels'] == cluster]
counts = cluster_data[column].value_counts()
plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')
plt.tight_layout()
plt.show()
# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']
# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
plt.subplot(len(prof_num), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['KmedoLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
K-medoids had a more equal distribution of customers over the clusters but they don't seem to have as much distinction between clusters. It distributed the complainers equally amongst the other clusters.
Summary for each cluster:
Observations and Insights:
hc_df = data_pca.copy()
hc_df1 = hc_df.copy()
# List of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df1, metric = dm, method = lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.6868745753666708. Cophenetic correlation for Euclidean distance and complete linkage is 0.7301698171979875. Cophenetic correlation for Euclidean distance and average linkage is 0.8611556779862961. Cophenetic correlation for Euclidean distance and weighted linkage is 0.7951552210547905. Cophenetic correlation for Chebyshev distance and single linkage is 0.642375507474261. Cophenetic correlation for Chebyshev distance and complete linkage is 0.6057135793014059. Cophenetic correlation for Chebyshev distance and average linkage is 0.8586546275910416. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.7252606349299516. Cophenetic correlation for Mahalanobis distance and single linkage is 0.6800460908323281. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.664531085942267. Cophenetic correlation for Mahalanobis distance and average linkage is 0.8221753980918977. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7771600130378179. Cophenetic correlation for Cityblock distance and single linkage is 0.733563535927652. Cophenetic correlation for Cityblock distance and complete linkage is 0.7125179768514398. Cophenetic correlation for Cityblock distance and average linkage is 0.8524722661820792. Cophenetic correlation for Cityblock distance and weighted linkage is 0.8166686006433075. **************************************************************************************************** Highest cophenetic correlation is 0.8611556779862961, which is obtained with Euclidean distance and average linkage.
# List of linkage methods
linkage_methods = ["single", "complete", "average", "centroid","ward", "weighted"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize = (15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df1, metric = "euclidean", method = method)
dendrogram(Z, ax = axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
# Create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns = compare_cols)
df_cc = df_cc.sort_values(by = "Cophenetic Coefficient", ascending = False)
df_cc
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 2 | average | 0.861156 |
| 3 | centroid | 0.853083 |
| 5 | weighted | 0.795155 |
| 1 | complete | 0.730170 |
| 0 | single | 0.686875 |
| 4 | ward | 0.564658 |
HCmodel = AgglomerativeClustering(n_clusters = n_clusters, metric = "euclidean", linkage = "ward")
HCmodel.fit(hc_df1)
AgglomerativeClustering(metric='euclidean', n_clusters=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(metric='euclidean', n_clusters=6)
# Adding hierarchical cluster labels to the original and whole dataframes
hc_df['HCLabels'] = HCmodel.labels_
df_feat_eng['HCLabels'] = HCmodel.labels_
df_feat_eng['HCLabels'].value_counts()
3 795 0 699 1 417 2 189 4 85 5 20 Name: HCLabels, dtype: int64
Think about it:
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
data_tsne = tsne.fit_transform(scaledX)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (5,5))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['HCLabels'],palette=color_palette)
plt.title("perplexity = 15")
plt.show()
# Assuming data_pca is your DataFrame with principal components and df_feat_eng['KmedoLabels'] is your label column.
num_pcs = data_pca.shape[1] # Number of principal components
plot_index = 1
plt.figure(figsize=(20,num_pcs * 10))
for i in range(num_pcs):
for j in range(i + 1, num_pcs):
plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j],
hue=df_feat_eng['HCLabels'],palette=color_palette)
plt.xlabel(f'PC {i+1}')
plt.ylabel(f'PC {j+1}')
plt.title(f'Principal Components {i+1} vs {j+1}')
plot_index += 1
plt.tight_layout()
plt.show()
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
plt.subplot(len(num_col.columns), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['HCLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['HCLabels'].nunique())
for i, variable in enumerate(except_income):
plt.subplot(n_rows, n_cols, i + 1)
sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['HCLabels'],
palette=color_palette)
plt.title(variable)
plt.tight_layout()
plt.show()
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
for h,i in enumerate(prof_categorical):
grouped_counts = df_feat_eng.groupby(['HCLabels',i]).size().reset_index(name='counts')
plt.subplot(n_rows, n_cols, h + 1)
sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'HCLabels', palette = color_palette)
plt.xlabel(i)
plt.ylabel('Counts')
plt.title(f'Hierarchical Clustering Profile by {i}')
plt.legend(title=i)
plt.tight_layout()
plt.show()
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
clusters = sorted(df['HCLabels'].unique())
n_clusters = len(clusters)
n_cols = len(columns)
plt.figure(figsize=(15, n_clusters * 5))
for idx, cluster in enumerate(clusters):
for jdx, column in enumerate(columns):
cluster_data = df[df['HCLabels'] == cluster]
counts = cluster_data[column].value_counts()
plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')
plt.tight_layout()
plt.show()
# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']
# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
plt.subplot(len(prof_num), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['HCLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
Hierarchical clustering seemed to have made some identifiable clusters. Cluster 2 seems to be well educated single parents. Cluster 4 seems to be Dual Income No Kids who like to spend money. Cluster 1 might be a more average family who want to get the most bang for their buck.
Summary of each cluster:
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
dbscan_df = data_pca.copy()
dbscan_df1 = dbscan_df.copy()
# Initializing lists
eps_value = [2,3] # Taking random eps value
min_sample_values = [6,20] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
# Finding the silhouette_score for each of the combination
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps = i, min_samples = j).fit(dbscan_df) # Applying DBScan to each of the combinations in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype = bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(dbscan_df, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.6287446447705786 For eps value =2 For min sample =20 The average silhoutte_score is : 0.6179350807374098 For eps value =3 For min sample =6 The average silhoutte_score is : 0.659899134648781 For eps value =3 For min sample =20 The average silhoutte_score is : 0.6769662970660738
# Printing the highest silhouette score
print(
"Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(
high_silhouette_avg, high_i_j[0], high_i_j[1]
)
)
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
# Applying DBSCAN with eps as 3 and min sample as 20
dbs = DBSCAN(eps = 3, min_samples = 20)
# Add DBSCAN cluster labels to dbscan data
df_feat_eng["DBLabels"] = dbs.fit_predict(dbscan_df1)
# Add DBSCAN cluster labels to whole data
dbscan_df1['DBLabels'] = dbs.fit_predict(dbscan_df)
df_feat_eng['DBLabels'].value_counts()
0 2185 -1 27 Name: DBLabels, dtype: int64
Observations and Insights: DBscan is not usable since it only made 2 clusters and one of them is 10 times larger than the other.
Think about it:
Summary of each cluster: N/A
gmm_df = data_pca.copy()
# Let's apply Gaussian Mixture
gmm = GaussianMixture(n_components = n_clusters, random_state = 1) # Initializing the Gaussian Mixture algorithm with n_components = 4
gmm.fit(gmm_df)
GaussianMixture(n_components=6, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianMixture(n_components=6, random_state=1)
gmm_df["GMM_segments"] = gmm.predict(gmm_df)
df_feat_eng["GMMLabels"] = gmm.predict(data_pca)
df_feat_eng['GMMLabels'].value_counts()
2 772 0 527 1 505 5 232 3 149 4 20 Name: GMMLabels, dtype: int64
# Plot the tsne values with perplexity = 15
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 15)
data_tsne = tsne.fit_transform(scaledX)
data_tsne = pd.DataFrame(data_tsne)
data_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (5,5))
sns.scatterplot(x = 'X1', y = 'X2', data = data_tsne, hue=df_feat_eng['GMMLabels'],palette=color_palette)
plt.title("perplexity = 15")
plt.show()
Observations and Insights:
# Assuming data_pca is your DataFrame with principal components and df_feat_eng['KmedoLabels'] is your label column.
num_pcs = data_pca.shape[1] # Number of principal components
plot_index = 1
plt.figure(figsize=(20,num_pcs * 10))
for i in range(num_pcs):
for j in range(i + 1, num_pcs):
plt.subplot((num_pcs * (num_pcs - 1)) // 4 + 1, 2, plot_index)
sns.scatterplot(data=data_pca, x=data_pca.iloc[:, i], y=data_pca.iloc[:, j],
hue=df_feat_eng['GMMLabels'],palette=color_palette)
plt.xlabel(f'PC {i+1}')
plt.ylabel(f'PC {j+1}')
plt.title(f'Principal Components {i+1} vs {j+1}')
plot_index += 1
plt.tight_layout()
plt.show()
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(num_col):
plt.subplot(len(num_col.columns), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['GMMLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
# Define the number of rows and columns for the subplots
n_rows = (len(except_income) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
# Set the color palette to colorblind
color_palette = sns.color_palette("colorblind", df_feat_eng['GMMLabels'].nunique())
for i, variable in enumerate(except_income):
plt.subplot(n_rows, n_cols, i + 1)
sns.scatterplot(data=df_feat_eng, x='Income', y=variable, hue=df_feat_eng['GMMLabels'],
palette=color_palette)
plt.title(variable)
plt.tight_layout()
plt.show()
n_rows = (len(prof_categorical) + 1) // 2
n_cols = 2
plt.figure(figsize=(15, n_rows * 5))
for h,i in enumerate(prof_categorical):
grouped_counts = df_feat_eng.groupby(['GMMLabels',i]).size().reset_index(name='counts')
plt.subplot(n_rows, n_cols, h + 1)
sns.barplot(data = grouped_counts, x=i, y = 'counts', hue = 'GMMLabels', palette = color_palette)
plt.xlabel(i)
plt.ylabel('Counts')
plt.title(f'GMM Cluster Profile by {i}')
plt.legend(title=i)
plt.tight_layout()
plt.show()
# Function to plot pie charts
def plot_pie_charts(df, columns, title_prefixes):
clusters = sorted(df['GMMLabels'].unique())
n_clusters = len(clusters)
n_cols = len(columns)
plt.figure(figsize=(15, n_clusters * 5))
for idx, cluster in enumerate(clusters):
for jdx, column in enumerate(columns):
cluster_data = df[df['GMMLabels'] == cluster]
counts = cluster_data[column].value_counts()
plt.subplot(n_clusters, n_cols, idx * n_cols + jdx + 1)
plt.pie(counts, labels=counts.index, autopct='%1.1f%%', colors=plt.cm.Paired.colors)
plt.title(f'{title_prefixes[jdx]} for Cluster {cluster}')
plt.tight_layout()
plt.show()
# Columns to plot
columns = ['Education', 'Marital_Status']
# Titles for the pie charts
title_prefixes = ['Education', 'Marital_Status']
# Plot pie charts
plot_pie_charts(df_feat_eng, columns, title_prefixes)
# Set plot size
plt.figure(figsize = (15, 30))
# Standardize the order of box plots
group_order = range(n_clusters)
# Create a box plot showing the spread of the data for each group
for i, variable in enumerate(prof_num):
plt.subplot(len(prof_num), 2, i + 1)
sns.boxplot(y=df_feat_eng[variable], x=df_feat_eng['GMMLabels'], order = group_order)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
GMM has some interesting clusters. There always seems to be, with every method, a very large cluster of low income people who do not make any purchases with the company, that is group 2 for GMM. GMM has identified the high income, high spenders who don't have any children, which is group 5. GMM also identified the families who are trying to find the discount purchases but don't have a super high income, which is group 0.
Summary of each cluster:
kmeans = KMeans(n_clusters = n_clusters, random_state = 1, n_init = 'auto') # Initializing K-Means with number of clusters as 4 and random_state=1
preds = kmeans.fit_predict((data_pca)) # Fitting and predicting K-Means on data_pca
score = silhouette_score(data_pca, preds) # Calculating the silhouette score
print(score)
0.39032909475551403
kmedoids = KMedoids(n_clusters = n_clusters, random_state = 1) # Initializing K-Medoids with number of clusters as 4 and random_state=1
preds = kmedoids.fit_predict((data_pca)) # Fitting and predicting K-Medoids on data_pca
score = silhouette_score(data_pca, preds) # Calculating the silhouette score
print(score)
0.12392534797169114
# Initializing Agglomerative Clustering with distance as Euclidean, linkage as ward with clusters = 4
HCmodel = AgglomerativeClustering(n_clusters = n_clusters, metric = "euclidean", linkage = "ward",)
# Fitting on PCA data
preds = HCmodel.fit_predict(data_pca)
score = silhouette_score(data_pca, preds) # Calculating the silhouette score
print(score)
0.30535809430078065
# Initializing Gaussian Mixture algorithm with number of clusters as 4 and random_state = 1
gmm = GaussianMixture(n_components=n_clusters, random_state=1)
# Fitting and predicting Gaussian Mixture algorithm on data_pca
preds = gmm.fit_predict((data_pca))
# Calculating the silhouette score
score = silhouette_score(data_pca, preds)
# Printing the score
print(score)
0.3320312327281025
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
It was interesting to see how each one performed and the clusters the methods found. K-means had the highest silhouette score but I don't think I could identify the underlying patterns that made the clusters. I found Hierarchical and GMM clustering to make more sensible clusters. K-medoids was pretty useless since its clusters didn't seem to have any defining features that would be helpful.
2. Refined insights:
I think the most meaningful insight is that this company has a large swath of its customer base who are low income and make very few purchases. No offense to these customers, but they just aren't where the money is. I would put more effort into identifying and finding the customers that fit into the higher spending clusters that we made today.
3. Proposal for the final solution design:
I suggest the company to adopt the Hierarchical Clustering model. I was choosing between that and the GMM model and while the GMM model has a higher silhouette score and a little bit of a more balance distribution, I think the Hierarchical clusters make the most sense and give the most direction for where to focus your resources. Cluster 4 is the cluster that the company needs to mine and grow. You need to be taking this profile and using it to target your advertising to find more potential customers that would fit into that cluster. Then after trying to scout out more customers like that, you can focus on in-store promotions for cluster 0 since that is a high-income, high spending cluster that has a larger number of customers in it.
To summarize, we want to grow and spend money on adverstising to attract leads that are similar to the cluster 4 profile and we want to focus our customer retention efforts on customers that fit into cluster 0.
As evidenced above, we have wasted a lot of resources on marketing campaigns that are ignored by our customers because we're not listening to them. We need to hear what they say and how they behave and meet them where they are. To do that here are our proposed actionable steps:
The benefits of this solution is that we will be much more efficient with our resources. With our prior strategy our accepted campaign rate was roughly 9% because so many offers went to people who were never going to purchase anything in the first place. By using our clusters, we believe that the accepted offer rate will at least double to almost 20%. We will be reaching fewer people, but every dollar spent will have a higher ROI. Our marketing campaign to increase loyalty amongst those in cluster 0 is estimated to increase retention by 10% and thus increase overall revenue by 3%.
The costs of this solution will be a slight loss in revenue from those in cluster 3 because we are not advertising to them as much, but overall that will save us money. The other cost will be an increased budget for the data collection team as they will need to spend extra time instituting the system to track time of purchase.
The risks of our solution is that we are focusing on a relatively small group of customers. This group only has a total of 85 customers in it right now, so we can't be sure that their behavior is scaleable. While it could prove more difficult to grow that cluster than I am outlining in this proposal, I believe it is worth the risk because their impact is comparatively outsized.
We believe with our proposed solution, we can consistently increase profits as we focus on the customers who want to make purchases from our store, but that does not mean this report completes our analysis as a company. In order to consistently increase our profits and grow as a business, we need to consistently invest in our understanding of our customers and what kinds of transactions they're making. We need to continually experiment with prices to see where our customers thresholds are and what they value. We need to run campaigns to see what customers want to see from us, do they value lower prices or do they value the experience of shopping in our store more? Do they like shopping online or do they like in-person still? With more information, such as time of purchase and everything else outline here, we can further optimize our store and continue to grow.